Adam (“adaptive moment estimation”) is an optimizer for gradient descent. It was first proposed in Kingma and Ba (2014) and is discussed extensively, eg in Chaudhury 2024 and Ruder 2017.

Adam incorporates both momentum and adaptive learning rates. It is generally the default choice for most commercial Deep learning tasks.

Adam maintains two state vectors and that track the moving averages of the gradient and the squared gradient respectively, and which estimate the first and second raw moments of the gradient respectively:

Because the momentum terms and are initialized to zero, the state vectors are biased towards zero early in training. To overcome this, Adam uses bias-corrected versions of and , defined as

Finally, the new value of the parameters is defined as

where is a learning rate (“step size”) and is a small constant to ensure no division by zero.